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Abstract 

Background: It has been widely realized that pathways rather than individual genes govern the course of 
carcinogenesis. Therefore, discovering driver pathways is becoming an important step to understand the molecular 
mechanisms underlying cancer and design efficient treatments for cancer patients. Previous studies have focused 
mainly on observation of the alterations in cancer genomes at the individual gene or single pathway level. However, a 
great deal of evidence has indicated that multiple pathways often function cooperatively in carcinogenesis and other 
key biological processes. 

Results: In this study, an exact mathematical programming method was proposed to de novo identify co-occurring 
mutated driver pathways (CoMDP) in carcinogenesis without any prior information beyond mutation profiles. Two 
possible properties of mutations that occurred in cooperative pathways were exploited to achieve this: (1 ) each 
individual pathway has high coverage and high exclusivity; and (2) the mutations between the pair of pathways 
showed statistically significant co-occurrence. The efficiency of CoMDP was validated first by testing on simulated 
data and comparing it with a previous method. Then CoMDP was applied to several real biological data including 
glioblastoma, lung adenocarcinoma, and ovarian carcinoma datasets. The discovered co-occurring driver pathways 
were here found to be involved in several key biological processes, such as cell survival and protein synthesis. 
Moreover, CoMDP was modified to (1) identif/ an extra pathway co-occurring with a known pathway and (2) detect 
multiple significant co-occurring driver pathways for carcinogenesis. 

Conclusions: The present method can be used to identify gene sets with more biological relevance than the ones 
currently used for the discovery of single driver pathways. 



Background 

The pathogenesis of cancer in humans is still poorly 
understood. To improve the diagnosis and treatment 
of cancer patients, several large-scale cancer genomics 
projects (e.g., the Cancer Genome Atlas (TCGA) [1], and 
International Cancer Genome Consortium (ICGC) [2]) 
have been performed in recent years. Analyzing these 
high-throughput data provides valuable opportunities to 
understand the formation and progression of cancer [3,4]. 

Generally, a large number of mutations occur in can- 
cer genomes (e.g., somatic mutations and copy number 
alterations (CNAs)). One crucial step in cancer research 
is to distinguish driver mutations and driver genes, which 
contribute to the progression of cancer from normal to 
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malignant states, from passenger mutations and passenger 
genes, which accumulate in cells but do not contribute to 
cancer development [5,6]. Most early efforts were devoted 
to the detection of individual driver genes based on recur- 
rent mutations of the genes in a large cohort of cancer 
patients [7]. 

Because of the mutational heterogeneity of cancer 
genomes, more attention has been paid to identify driver 
pathways and modules rather than individual genes in 
recent years [1,8,9]. It is noteworthy that most such meth- 
ods involve the use of prior knowledge about pathways 
and/or protein interaction networks. For example, known 
pathways were analyzed for enrichment of somatic muta- 
tions [1,8,9], or were examined to find which ones are 
significantly disturbed across many patients [10,11]. On 
the other hand, several studies indicated that driver path- 
ways often cover a large number of samples. More impor- 
tantly, mutations of the genes in one pathway usually 
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exhibit mutual exclusivity, i.e., a single mutation is usually 
enough to disturb one pathway [12,13]. These rules have 
been frequently used to detect driver pathways and mod- 
ules [14-16]. For example, Ciriello et al. proposed MEMo 
(Mutual Exclusivity Modules) to detect oncogenic net- 
work modules within a constructed network using gene 
mutation information and a human reference network 
(including protein interactions and signal transduction 
pathways) [14]. 

However, it is believed that the human protein inter- 
action network is incomplete. There are many unknown 
protein interactions and a great deal of knowledge about 
biological pathways remains unclear. Many reverse engi- 
neering approaches were developed in recent years to 
infer complex biological regulatory relationships. For 
example, Acharya et al proposed the Gene Set Gibbs 
Sampling (GSGS) method to reconstruct signaling path- 
way structures by sequentially inferring the information 
flows from the overlapping information flow gene sets 
[17]. It is necessary to develop new methods to discover 
mutated driver pathways or core modules without relying 
on prior information. Recently, Vandin et al proposed an 
approach, called Dendrix {de novo driver exclusivity), to 
de novo discover mutated driver pathways using somatic 
mutation data [18]. In this method, a novel weight func- 
tion was introduced by combining the coverage and exclu- 
sivity of the gene set. Maximization of the weight function 
is defined as the maximum weight submatrix problem. 
This was originally solved by the Markov chain Monte 
Carlo (MCMC) method [18], and was then addressed 
using an exact binary linear programming (BLP) model 
[19]. However, these studies for the identification of driver 
pathways or core modules have all focused on single path- 
ways or modules [15,16,18,19]. How various cellular and 
physiological processes are coordinately altered during 
the initiation and progression of cancer, it is still a major 
challenge. 

It is well known that multiple pathways with muta- 
tions are generally required for cancer [20]. In fact, it 
has been recently recognized that pathways often func- 
tion cooperatively in carcinogenesis [13,21-23]. Based on 
mutation data from COSMIC [24] and six major cancer- 
associated pathways from previous studies, Yeang et al 
demonstrated that there were significant combinatorial 
patterns of mutations occurring in the same patients (i.e., 
co-occurring), for which the corresponding genes usu- 
ally function in different pathways, whereas mutations in 
genes functioning in the same pathway are rarely mutated 
in the same sample (i.e., mutually exclusive) [13]. Cui etal 
identified 12 oncogene-signaling blocks from the inte- 
grated human signaling network [21]. They found that 
some of them (such as the RAS and TPS3 blocks in cen- 
tral nervous system, pancreas, skin, and blood tumors) 
would collaboratively promote cancer signaling and foster 



tumorigenesis. Using 18 pathways enriched with muta- 
tions in lung adenocarcinoma [8], Gu et al investigated 
pathway cooperation in cancer cells in terms of superpath- 
ways, which are clusters of co-disrupted pathways whose 
significance is tested by the hypergeometric model [25]. 
More recently, Gu et al devised a heuristic approach to 
detect cooperative functional modules in the glioblastoma 
multiforme (GBM) altered network which is obtained by 
mapping mutated genes onto a protein interaction net- 
work from the Pathway Commons database, and several 
pairs of significantly co-altered modules were identified 
which are involved in the main pathways known to be 
perturbed in GBM [26]. 

All these studies indicate that carcinogenesis is a com- 
plex process and the malignant transformation from a 
normal cell to a tumor is indeed a highly cooperative 
procedure involving synergy between pathways. There- 
fore, systematically exploring the complex collaboration 
among different biological pathways and functional mod- 
ules is a crucial step, which will shed new lights on 
our understanding of the cellular mechanisms underlying 
tumorigenesis. Current studies have mainly focused on 
the utilization of prior knowledge to determine whether 
two or more pathways or modules are simultaneously 
perturbed in the same samples. Considering the incom- 
pleteness of the knowledge about pathways and protein 
interaction networks, de novo discovery of collabora- 
tive pathways playing driver roles in cancer initiation 
and development is of pressing need. Although itera- 
tively performing Dendrix [18] or BLP [19] can obtain 
multiple pathways by removing the gene sets found in 
each previous iteration, however, such pathways are not 
guaranteed to be significantly co-disrupted in the same 
patients. 

In this study, a mathematical programming approach to 
discover co-occurring mutated driver pathways (CoMDP) 
in cancer generation and progression was developed. The 
co-occurring pathways detected here possess two prop- 
erties: first, each pathway is a set of mutated genes with 
high coverage and high exclusivity; second, the muta- 
tions between pathway genes exhibit a statistically sig- 
nificant co-occurrence in cancer samples. CoMDP is 
an exact method where the optimal set of pathways 
is obtained using an efficient algorithm. It does not 
require any prior information besides mutation pro- 
files. To evaluate this method, we first applied it onto 
simulated data and compared it with the original BLP 
method. Then we applied it onto four biological datasets 
and several pathways which might play collaborative 
roles in carcinogenesis were identified. For example, 
for the glioblastoma tumor data and lung adenocarci- 
noma data, several significant co-occurring gene path- 
ways were detected. Each pair interacts and regulates the 
cell survival and protein synthesis processes. In addition, 
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a modified form (named mod_CoMDP) was proposed 
in situations in which a certain pathway has been pre- 
viously proven to play important roles in some can- 
cers and one wants to know whether there are other 
pathways with cooperative effects in tumorigenesis. Fur- 
thermore, multiple co-occurring driver pathways can be 
discovered by combining previously detected pairs of 
gene sets and identifying others using mod_CoMDR 
When applied to the ovarian carcinoma dataset, CoMDP 
and/or mod_CoMDP identified driver pathways related to 
TP53 in the generation and progression of ovarian can- 
cer. In summary, we developed a method for identifying 
mutated co-occurring driver pathways which can enhance 
the understanding of molecular mechanisms underlying 
tumorigenesis. 

Methods 

Brief introduction of the maximum weight submatrix 
problem 

The Dendrix method was designed to de novo discover 
a single mutated driver pathway from somatic muta- 
tion data, where a weight function W was introduced 
by combining the coverage and exclusivity of the gene 
set [18]. Given a binary mutation matrix A with m rows 
(samples) and n columns (genes), the maximization of 
W was defined as the maximum weight submatrix prob- 
lem [18], which means to find a submatrix M of size 
m X /c in the mutation matrix A by maximizing the weight 
function W: 

w{M) = \r(M)\ - co(M) = 2\r(M)\ - J2 ir(^)l, (i) 

geM 

where r(g) = {i : Aig = 1} denotes the set of patients 
in which the gene g is mutated and r(M) = UgQM^(g)> 
|r(M)| measures the coverage of M and co(M) = 
J2geM \ ^(^)\ ~ I r (^) I measures the coverage overlap 
ofM. 

In addition to the stochastic MCMC search procedure 
[18], a BLP model has been introduced to solve this prob- 
lem exactly [19]. Inspired by the BLP model, a binary lin- 
ear programming model CoMDP to discover co-occurring 
driver pathways was developed here (Figure 1). The focus 
was placed on finding possible cooperative driver path- 
ways in carcinogenesis. For example, in Figure 1, the gene 
set D not C can be detected using MCMC or BLP for 
k = 2. This is because gene set D has higher mutation 
score W than that of gene set C. Mutations in the gene 
set B and the gene set C occurred simultaneously among a 
cohort of patients. CoMDP can successfully identify such 
co-occurring gene sets which may have been missed by 
previous approaches. 
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These two gene sets, each with approximate exclusivity, co-occur 
among the same patients. 


Figure 1 Schematic illustration of driver gene sets (pathways) 
and their co-occurrence in the mutation matrix. Three gene sets, 
B, C, and D, with high coverage and high exclusivity are shown. Gene 
sets B and C occur simultaneously among the same patients. 



CoMDP: a binary linear programming model for the 
identification of co-occurring driver pathways 

For the mutation matrix A, let us consider two sub- 
matrices M and N (which correspond to two gene sets 
or pathways S and T), Given the coverage r(M) and 
r(N) of the two gene sets (sometimes called individ- 
ual coverage in this study), we define (1) the common 
coverage c(M,N) = \r(M) C\r(N)\; (2) the union cov- 
erage b(M,N) = |r(M) U r(A^)|. We further define the 
non-shared coverage d{M,N) = b(M,N) — c(M,N), 
which describes the extent of the mutation co-occurrence 
between the two gene sets: the smaller the value d, the 
larger the co-occurrence is. As suggested before, (jo(M) 
and CO (N) reflect the exclusivity of M and N respectively. 

To identify co-occurring gene sets with large coverage 
and high exclusivity, we introduce the following weight 
function H: 

H(M, N) = c(M, N) - d(M, N) - co(M) - co (N) . (2) 

To maximize this weight function, a binary linear pro- 
gramming model is introduced as follows: 

mm m 

max G{x, y, z, u,v) = X Zi + rj ^^(»^/ + yi — 2zi) + Xi 

i=l i=l i=l 

n I m \ m 

j=i \ i=i I i=i 

n I m \ 

(3) 
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j=\ j=i 

n n 

\ E ^ij^j < J/ < E / = 1, . . . , m, 

j=i j=i 

s.t. { + J/ - 1 < ^^i < + yu / = 1, . . . , m, (4) 

+ V/ < 1, 7 = 1, . . . , 

n 

E(W; + Vy) 

Zh Ujy Vj G {0, 1}, / = 1, . . . , m, 7 = 1, . . . , 

where r/y and v, are indicators whether column ; of A falls 
into the submatrx M or N, so all the columns / s with 
Uj = 1 and V/ = 1 constitute M and N respectively; Xi 
and 3// are indicators whether the entries of row / of M 
and N are not all zeros, so E^i E/li J^' represent 

the coverage of M and A/^ (i.e., |r(M)| and |r(A/^)|) respec- 
tively; Zi is the indicator whether both Xi and equal to 
1, so E/ii represents the overlap between the coverage 
of M and (i.e., the common coverage c(M,N)). k is the 
total number of genes within S and T; and finally, A. and 
T] are two parameters controlling the common coverage 
c(M,N) and the non-shared coverage d{M,N) of the two 
gene sets. 

Note that E2.1 ^^d E2.i(^^' + " in model (3) 
are always nonnegative according to the constraints in (4). 
One can properly set A. and rj to be positive or negative to 
obtain gene sets with specific characteristics. For example, 
if A < 0 and > 0, the model tends to detect gene sets 
with large non-shared coverage but small common cov- 
erage under the maximization of G{x,y,z, u, v). Certainly, 
A > 0 and 77 < 0 must be set if one wants to identify 
co-occurring driver pathways by maximizing the function 
H in (2), which is the main focus of this study. More dis- 
cussion on the behavior of the model with A and r] can be 
referred to Simulation study below. 

mod_CoMDP: Finding a pathway that co-occurs with a 
Icnown one 

In some cases, some prior information is known for a 
disease. For example, a certain pathway may have been 
previously proven to play important roles in cancer. The 
problem is determining whether another pathway with a 
cooperative effect on tumorigenesis exists. CoDMP can 
be modified to answer this question to some extent. For 
a known pathway or a gene set C, a possible co-occurring 
pathway D can be identified by the following modified 
optimization problem: 



m m 



max GcCy, z, v) = A ^ z,- + ^{yt - 2zi) + 




i=l 



(5) 



^ E ^ij^j < J/ < E ^iJ^J> ^ = 1, . . . , m, 

Xi + 3// - 1 < 2zi < Xi + yi, / = 1, . . . , m, 
s.t. \ Uj + V/ < 1, ; = 1, . . . , 

n 

E = 

yi,Zi,Vj / = ; = 

(6) 



where Xi and Uj are indicators whether the entries of row / 
in gene set C are not all zeros and whether the gene corre- 
sponding to column ; of A falls into C, respectively; y^, z^, Vj 
and the parameters X and r] have the same meaning as in 
(3) and (4); r is the size of the desired gene set D, 

Generally, a branch-and-bound algorithm or others can 
be used to produce an optimal exact solution for CoMDP 
(also for mod_CoMDP). In this study an IBM ILOG 
CPLEX Optimizer was used to test the effectiveness of 
the model. The experiments were performed on a 2.50 
GHz Core i5-2520M CPU PC. For each given /c, CoMDP 
can automatically identify two gene sets when the sum of 
their sizes equals /c. Although the problem was NP-hard, 
it can still be solved efficiently due to the sparsity of the 
mutation matrix. 

Statistical significance 

A permutation test was used to assess the significance 
of the results. As in a previous study [18], the weight W 
in (1) served as a statistic to test the significance of the 
exclusivity and coverage of each identified gene set (called 
individual significance). We employed the co-occurrence 
ratio, which is defined as the ratio of the common cov- 
erage to the union coverage, as the statistic to test the 
significance of the co-occurrence of these two gene sets 
(called co-occurrence significance). 

Simulation data 

Three datasets were constructed to illustrate the proper- 
ties of the proposed method. The first set of simulated 
data, Sim_datal, was generated as in a previous study [19]. 
First, an empty m (samples) x n (genes) matrix was given 
[m = 500, n = 1, 000 were used). Then, gene sets Mi 
(/ = 1, • • • ,/; and each set has 10 genes) with a mutation 
probability pi were embedded in the matrix {pi = 1 — / • A, 
A = 0.05, and 7 = 10 were used here). For each sample, a 
gene uniformly chosen from Mi with pi was mutated, and 
once one gene was mutated, the other genes in Mi had a 
probability po to be mutated (here po = 0.04 was used). 
Finally, the genes not in Mi were mutated in at most three 
samples. The second dataset, Sim_data2, was generated 
using the strategy described above for noisy probability po 
from 0.04 to 0.24 in steps of 0.02. 
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The third dataset, Sim_data3, was generated as follows. 
Starting with an empty r x s matrix (here r = 600, s = 
1, 000 were used), we embedded / gene sets A^i, A^2> • • • 
(/ > 1, here / = 9 was used) into it. AT/ has size m/ x hq 
(in this study no = 5 and mi = [m/2] +2^~^ were used 
where [ m/2] denotes the integer part of m/2). Ni was con- 
structed according to the strategy like that for Mi stated 
above. Similarly, we mutated the genes not in Ni at most 
in three samples. 

We note that the average mutation rate for genes in a 
dataset in current simulation study is comparable to those 
of real datasets. For example, for Sim_data2 with the noisy 
probabilities of 0.04 and 0.24, each gene has an average 
mutation rate 0.0142 and 0.0274 respectively. For the four 
biological datasets (GBMl, GBM2, lung cancer and ovar- 
ian cancer) introduced in the following subsection, each 
dataset has an average mutation rate of 0.0658, 0.0416, 
0.0206, and 0.0134 for each gene, respectively. 

Biological data 

To assess the proposed methods for practical applica- 
tions, four biological datasets were collected due to their 
popularity and abundant prior knowledge. Note that the 
CoMDP can be easily applied to other cancer mutation 
datasets. 

The glioblastoma multiforme data 1 (GBMl) and the 
lung adenocarcinoma dataset were obtained directly from 
a previous study [18]. These sets contain mutations in 178 
genes across 84 GBM patients (samples) and 356 genes 
in 163 lung cancer patients, respectively. The GBM2 and 
ovarian carcinoma datasets were obtained from another 
previous study [16]. These sets contain CNAs for 1269 
genes spanning 169 GBM patients, somatic mutations for 
343 genes across 135 GBM patients, CNAs in 966 genes 
across 559 ovarian patients, and somatic mutations in 
8431 genes across 320 ovarian cancer patients. For the 
last two datasets, somatic mutations and CNAs were first 
integrated by merging the genes on the common patients. 
Finally, a binary mutation matrix A was obtained for each 
of the four datasets. The genes that are mutated in the 
same samples were combined into a gene set which was 
named as a metagene in this study. Note that the defini- 
tion of a metagene differs from that defined based on the 
matrix factorization method. 

Results and discussion 

Simulation study 

CoMDP can get the optimal solution of the original 
maximum weight submatrix problem. Like the BLP 
model [19], CoMDP can detect all the embedded gene 
sets in Sim_datal with k = 10, ^ = 1 and A. < 0 (here 
A. = — 10 was used). In the current situation, CoMDP usu- 
ally degenerates to find one gene set which corresponds to 
the optimal solution of BLP. Sometimes it can produce two 



gene sets with no common coverage. For example, with 
Sim_datal where ten gene sets were embedded in the 500 
(samples) x 1000 (genes) matrix, we identified two sets 
with two and eight genes respectively, which are mutated 
in 74 and 340 samples respectively. But their common cov- 
erage is 0. In fact, these two sets constitute one of the 
embedded gene sets which can be found using the BLP 
method, so they can be viewed as the same driver gene set 
as obtained using BLP directly. 

Note that as po increases, the exclusivity among the 
genes in Mi decreases, so the detection of the embed- 
ded gene sets Mi becomes more and more difficult. Let 
k = 10. CoMDP {X = -10, 7j = 1) and BLP were 
applied on Sim_data2. Both were able to precisely iden- 
tify all ten embedded gene sets when po < 0.10. For 
BLP, the average number of detected embedded gene sets 
decreased sharply as po increased. However, by properly 
choosing the parameter r], CoMDP can obtain more accu- 
rate and robust results than BLP at high values of po. For 
example, CoMDP with X = —10 and rj = 2 has much 
higher identification accuracy than BLP for po > 0.20 
(Figure 2A). 

CoMDP can identify co-occurring gene sets efficiently. 

We further applied CoMDP to Sim_data3 to demonstrate 
its effectiveness, and assessed the effect of X and rj. We 
found that the results are robust with the selection of these 
two parameters. CoMDP can always get the embedded 
gene sets with the largest co-occurrence ratio 0.8696 for 
X ranging from 6 to 24 in step of 2 and r] ranging from 
-10 to -1 in step of 1. The performance of CoMDP was 
also demonstrated on different r] with A. = 10 (Figure 2B). 
For example, when ^ < 3, the two detected 5-gene sets 
mutated almost in the same samples (the individual cov- 
erage of these two sets was 286 and 273 and common 
coverage was 260). When rj became larger the coverage 
difference of the two sets increased, and the common cov- 
erage became smaller. When r] > 12, CoMDP detected 
one gene set with 10 genes, which had the coverage 437 
and so the co-occurrence ratio was 0. Generally speak- 
ing, we found that CoMDP has similar performance when 
X equals 2 or 10 (Figure 2C). A high co-occurrence ratio 
(i.e., 0.8696) was obtained when rj < 0.5, and a ratio of 
0 was obtained when rj > 3.5. The simulation study con- 
firmed that A > 0 and rj < 0 are the proper selections for 
identifying co-occurring gene sets. In the following bio- 
logical applications, without loss of generality, A = 10 and 
rj = —2 were used. 

Applications to biological data 

In this section, CoMDP was used on four biological 
datasets (i.e., GBMl, lung cancer, GBM2, and ovarian 
carcinoma datasets) to identify the co-occurring driver 
pathways with /: = 4 ~ 10. We also demonstrated that 
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Figure 2 Results of CoMDP applied to simulation data. (A) Comparison of CoMDP with BLP for tine ability to identif/ tine embedded gene sets. 
Display of co-occurrence of tine gene sets detected by CoMDP for different ^ when A. > 0: (B) A. = 10, and (C) A. = 2. 



mod_CoMDP (model (5) and (6)) was applied onto the 
ovarian carcinoma data to detect more driver pathways 
co-occurred with TP53 in carcinogenesis and to find mul- 
tiple significant co-occurring driver pathways. Each run 
for GBMl and GBM2 datasets takes less than two sec- 
onds, and each run for the lung cancer dataset takes less 
than four seconds. 



GBMl dataset 

For k = 4, two gene sets were detected: {CDKN2A, 
MGi} and {MTAP, CYP27B1} {MGi is a metagene con- 
sisting of CDK4, FAM119B, MARCH9, TSFM, CENTGl, 
METTLl and TSPAN31) with individual significance pi = 
0.0207, p2 = 0.0058, co-occurrence significance pi^2 < 
0.0001, and co-occurrence ratio ri,2 = 0.9412 (Table 1). 
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Table 1 Co-occurring gene sets identified by applying CoMDP to GBM1 



k 


Gene set 1 


Gene set 2 


Pi 


P2 




"2 


^,2 


Pl,2 


4 


CDKN2A, MGi 


MTAPXYP27B1 


0.0207 


0.0058 


50 


49 


0.9412 


< 0.0001 




CDKN2A, TPS 3, 
















5 




CDKN2BXYP27B1 


0.0003 


0.0018 


68 


57 


0.7606 


< 0.0001 




CDKN2A, PTEK 


CDKN2B, TP53, 














6 


CYP27B1 




0.0002 


0.0003 


69 


71 


0.8182 


< 0.0001 




CDKN2A, PTEN, 


CDKN2B,RBl 














7 


CYP27B1 


TP53, MG^ 


0.0002 


0.0001 


69 


74 


0.8571 


< 0.0001 




CDKN2A, PTEN, 


CDKN2B,RBl 














8 


NFh CYP27B1 


MG],ERBB2 


0.0015 


< 0.0001 


72 


70 


0.8933 


< 0.0001 




CDKN2A, PTEN,NFl 


CDKN2B,RBl 














9 


CYP27BIKDR 


MG^,ERBB2 


0.0006 


< 0.0001 


74 


70 


0.9200 


< 0.0001 




CDKN2A, PTENXYP27BI 


CDKN2B,NFl RBI, 














10 


KDR, MGj 


MG^ , ERBB2 


< 0.0001 


< 0.0001 


72 


73 


0.9333 


< 0.0001 



Here pi and p2 are the p-values of the individual significance of two identified gene sets, pi,2 represents the p-value of their co-occurrence 
denote their respective coverage, and ri,2 is the ratio of the common coverage to their union coverage (i.e., co-occurrence ratio). There are 
following tables. MG^ is a metagene including seven genes: CDK4, FAMI 19B, MARCH9, TSFM, CENTGl, METTL1, TSPAN31. MGi is a metagene 
SLaA2,PAX6,ABCC4. 



significance, ni and rij 
same meanings in the 
including four genes: WTl, 



The two genes MTAP and CDKN2A were found to be 
frequently co-deleted [27,28]. They are both located on 
chromosome 9p21, a typical tumor suppressor region 
whose deletion is related to many different types of can- 
cers. CYP27B1 and the metagene MGi were mutated in 
the same patients with one exception: a single-nucleotide 
mutation was recorded in one additional patient for 
CYP27B1 (Figure 3A). Previous studies have suggested 
that CDK4 is the target of a common CNA in the 
corresponding patients [29]. Two protein products of 
CDKN2A, INK4A (also known as pl6) and ARF (also 
known as pl4), are involved in the pS3 and RB tumor 
suppressor pathways (Figure 3B). It has been shown that 
any error disrupting these pathways causes tumor forma- 
tion [30]. CDKN2A and CDK4 are considered part of the 
RB pathway Both MTAP and CYP27B1 encode impor- 
tant enzymes. The enzymes encoded by MTAP play a 
major role in polyamine metabolism and those encoded 
by CYP27B1 play a role in calcium metabolism and tissue 
differentiation. 

For k = 5, two gene sets including {CDKN2A, TP53, 
MGi} and {CDKN2B, CYP27B1} were detected with pi = 
0.0003, p2 = 0.0018, pi^2 < 0.0001 and ri,2 = 
0.7606. CDKN2B encoding INK4B (also known as plS) 
also locates in chromosome 9p21 homozygous deletion 
region, and CDKN2B is usually co-deleted with CDKN2A, 
This disrupts the pS3 and RB pathways. For this rea- 
son, combinatorial inactivation of CDKN2A and CDKN2B 
is frequently observed in these tumors. The cross-talk 
between the pS3 and RB pathways (Figure 3B) suggests 
that CDKN2A, TP53 and CDK4 are in the same pathway 
(Figure 3C(a) or 3C(b)). 



For k = 10, two gene sets {CDKN2A, PTEN, CYP27B1, 
KDR, MG2} and {CDKN2B, NFh RBI MGi, ERBB2} with 
Pi,p2 and pi;2 less than 0.0001 and ri,2 = 0.9333 were 
identified (Table 1 and Figure 4A). The first gene set was 
found to be involved in the pS3 and PI3K/Akt signaling 
pathways and the second in the RB and RTK/RAS/ERK 
signaling pathways. The RTK/RAS/PI3K signaling path- 
way can also be induced by the mutations in these 
two gene sets (Figure 4B). These pathways are impli- 
cated in biological processes associated with cell survival, 
cell cycle, protein synthesis, and cell proliferation. pS3, 
RB, and RTK/RAS/PI3K have been previously reported 
to contribute to GBM pathogenesis in original TCGA 
GBM studies [1]. Five well-known tumor suppressors 
(CDKN2A, CDKN2B, PTEN, NFh and RBI) are involved 
in these two gene sets. Besides the co-occurrence of 
CDKN2A and CDKN2B, NFl and RBI in the second 
gene set have exclusive mutations, which are co-occurrent 
with mutations of PTEN in the first set (Figure 4A). 
Recently, several studies have shown the cooperativity of 
tumor suppressors in carcinogenesis [33-35]. For example, 
Rahrmann et al demonstrated that co-occurring muta- 
tions in PTEN and NFl cooperate in the development of 
grade 3 PNSTs (peripheral nerve sheath tumors) in mice, 
suggesting that they may cooperate in human MPNST 
(malignant PNST) progression [33]. Another study by 
Chow et al [34] showed that cooperativity among PTEN, 
TPS3, and RBI can cause high-grade astrocytoma in 
mouse adult brain, in which the majority of glioblastomas 
arise. For another two almost simultaneously mutated 
oncogenes CYP27B1 and CDK4 (Figure 4A), Beckner etal, 
have demonstrated their cooperative amplification and 
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Figure 3 The results of CoMDP on the GBM1 dataset with /c = 4 and /c = 5, and the related p53 and RB pathways. (A) The co-occurrence 
between the two gene sets {CDKN2A, MG] } and {MTAP, CYP27B1} identified by CoMDP with /c = 4 in the GBMl dataset. (B) The TP53 and RB tumor 
suppressor pathways. and INK4A are two alternatively spliced transcripts of CD/(A/2/\.This figure was adapted/extracted according to [31,32]. (C) 
The alterative gene regulatory pathway ((a) or (b)) involving CDKN2A, CDK4 and TP53. 



co-expression for potential modulation of vitamin D in 
glioblastomas [36]. 

On the other hand, WTl encodes a transcription fac- 
tor that plays an essential role in cellular development and 
cell survival [37]. It regulates the expression of numer- 
ous target genes, including the famous tumor suppressor 
TP53 and the Wnt signaling pathway [38]. KDR encodes 
a VEGF (vascular endothelial growth factor) receptor. 
VEGF plays a crucial role in angiogenesis and progression 
of malignant brain tumors. ERBB2 encoding the protein 
HER2 (human epidermal growth factor receptor 2) is a 



member of the epidermal growth factor receptor {EGFR) 
family, and it has been shown to play an important role 
in the pathogenesis and progression of many different 
types of cancer. WTl, KDR and ERBB2 may drive the car- 
cinogenesis of GBM, indicating that CoMDP can identify 
low-frequency candidate driver genes that play important 
roles in cancer initiation and development. 

Lung cancer 

In this case, the significant results were obtained with 
k = 4, 5, 10 (Table 2). For k = 4 the co-occurring gene sets 
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Figure 4 The results of CoMDP on the GBMI dataset with k = ^0. (A) The co-occurrence between the two gene sets [CDKN2A, PTEN, CYP27BI 
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represents sometimes there is a synthetic lethal interaction between them. The green and blue nodes denote the genes involved in the two 
detected co-occurring gene sets. The regulatory relations are extracted from the KEGG database and related literature. 



are {ATM, TPS3} and {EGFR, KRAS}, and for k = 5 they 
are {ATM, TP53} and {EGFR, KRAS, STKll}, As stated 
in a previous study [18], ATM and TP53 interact directly 
and are involved in the cell cycle checkpoint control [39]. 
EGFR, KRAS and STKll are all involved in the regula- 
tion of the mTOR signaling pathway, whose dysregulation 
has been reported to be important to lung adenocarci- 
noma [8]. However, the gene set {ATM, TP53} can only 
be obtained by removing the mutations of {EGFR, KRAS, 
STKll} from the dataset in the previous study by Vandin 
[18]. Here, these two gene sets were identified simultane- 
ously and found to show significant co-occurrence. ATM, 
TP53 are involved in the regulation of cell apoptosis and 
EGFR, KRAS, STKll are related to protein synthesis, indi- 
cating that the cooperativity of these two processes for the 
generation and progression of lung cancer. 

For k = 10, {STKll, ATM, TP53, PAK4} and {KRAS, 
NTRK3, EGFR, GNAS, EPHA3, NRAS} were identified. All 
four genes STKll, ATM, TP53, PAK4 have been demon- 
strated to be closely related to the p53 signaling pathway 



in lung cancer [40,41]. Two members of the RAS sub- 
family, KRAS and NRAS function as binary molecular 
switches controlling the intracellular signaling networks 
that regulate several key cancer-related processes, such 
as proliferation, differentiation, cell adhesion, apoptosis, 
and cell migration. GNAS is a guanine nucleotide-binding 
protein (G protein). It acts as a modulator or trans- 
ducer in various transmembrane signaling systems. GNAS 
may interact with MDM2, which may lead to MDM2' 
mediated degradation of TP53, Solomon et al, found that 
many kinds of TPS3 mutations can regulate RAS in dif- 
ferent ways, inducing a cancer-related gene signature [42]. 
Kosaka et al. demonstrated that TP S3, EGFR and KRAS 
may cooperatively determine the prognosis of the patients 
in lung adenocarcinoma [43]. 

GBM2 dataset 

We observed that some new co-occurring gene sets were 
identified for GBM2 compared to GBMI (Table 3). For 
k = 9, we identified {CDKN2A, TPS3, MG3, PIK3R1, 



Table 2 Co-occurring gene sets identified by applying CoMDP to the lung cancer data 

k Gene set 1 Gene set 2 pi p2 n-\ 712 ri,2 Pi,2 

4 ATMJP53 KRAS, EGFR 0.0121 < 0.0001 76 90 0.3833 0.0129 

5 ATM, TP53 STKl 1, KRAS, EGFR 0.01 25 < 0.0001 76 110 0.4091 0.0220 
STKl 1, ATM, KRAS, NTRK3, EGFR, 

10 TP53,PAK4 GNAS, EPHA3, NRAS 0.0379 < 0.0001 98 108 0.5489 0.0001 
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Table 3 Co-occurring gene sets identified by applying CoMDP to GBM2 with somatic mutations and CNAs 



k 


Gene set 1 


Gene set 2 


Pi 


P2 






^,2 


Pl,2 


4 


CDKN2A, MG3 


CDKN2B,CYP27B1 


0.0045 


0.0056 


60 


63 


0.9219 


< 0.0001 




CDKN2A, 
















5 


CYP27B1XOL1A2 


CDKN2B, MG3 


0.0072 


0.0018 


62 


63 


0.9531 


< 0.0001 




CDKN2A, 


CDKN2B, 














6 


CYP27B1XOL1A2 


MG3,ERBB2 


0.0073 


0.0003 


62 


65 


0.9538 


< 0.0001 




CDKN2A, TP53, 


CDKN2B,RB1, 














7 


MG3, TAFl 


CYP27B1 


0.0001 


0.0001 


77 


70 


0.8375 


< 0.0001 




CDKN2A, TP53, 


CDKN2B,RB1, 














8 


MG3, TAFl 


CYP27B1, PRNP 


0.0002 


< 0.0001 


77 


71 


0.8500 


< 0.0001 




CDKN2A, TP53, TAFl, 


CDKN2B,RB1, 














9 


MG3, PIK3R1 


CYP27B1, SYNEl 


0.0002 


< 0.0001 


79 


72 


0.8642 


< 0.0001 




CDKN2A, TP53, TAFl, 


CDKN2B,RB1, SYNEl, 














10 


MG3, PIK3R1 


CYP27B1, MG4 


0.0001 


< 0.0001 


79 


73 


0.8765 


< 0.0001 



MG3 is a metagene including three genes: CDK4, MARCH9, TSPANSh MG4 is a metagene including 168 genes. 



TAFl} and {CDKN2B, CYP27B1, RBI, SYNEl} {MG3 is 
a metagene including CDK4, MARCH9 and TSPAN31) 
with pi = 0.0002, p2 < 0.0001, pi^2 < 0.0001 and 
ri,2 = 0.8642 (Figure 5A and Table 3). In addition to 
the cooperative effects between CDKN2A and CDKN2B, 
TP53 and RBI have been reported to have frequently co- 
occurring mutations related to several cancers, including 
the central nervous system tumor [13]. Recently, the col- 
laboration of TPS3 and CDKN2B was also studied with 



respect to cell apoptosis and aneurysm formation [44]. 
On the other hand, for the two detected low-frequency 
mutated genes TAFl (2/170) and SYNEl (3/170), TAFl 
encoding a transcription initiation factor phosphorylates 
TPS3 during Gl cell-cycle progression, so TAFl may be a 
member of the pS3 signaling pathway; SYNEl was found 
to be associated with the GBM patients' lifetime, and 
was therefore considered to be an important biomarker 
of glioblastoma survival [45]. Our studies indicated that 
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Figure 5 The results of CoMDP on the GBIVI2 dataset with k = 9. (A) The significant co-occurrence mutations between tine two gene sets 
{CDKN2A, TP53, MG3, PIK3R1, TAFl} and {CDKN2B, CYP27B1, RBI, SYNEl}. (B) The relevant pathways of the two gene sets. 
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the pS3, RB, and the P73/C-related signaling pathways may 
collaboratively contribute to carcinogenesis in GBM via 
combined genetic alterations (Figure 4B and Figure 5B). 

Ovarian cancer 

The mutation distribution among genes in the ovarian 
carcinoma data is quite nonuniform. Among all the 314 
samples, TPS3 is mutated in 251 of them and all the 
other genes are mutated in less than 26% of samples. This 
indicates that TP53 plays a crucial role in the carcinogen- 
esis of ovarian cancer {TTN was removed in the present 
analysis because of the possible artifacts of its mutations 
[46]). Determining whether there are other driver genes 
or pathways collaborating with TPS3 will be helpful for 
understanding the pathogenesis of this cancer. 

We applied CoMDP to the ovarian cancer data with 
k = 4i ^ 10 (Table 4). The first three rows in Table 4 
showed significantly co-occurring gene sets with TPS3, 
For k = 5 we identified {MYQ CCNEl, NINJ2, MG5} 
(MG5 is a metagene including CHKB and KLHDC7B), 
MYC and CCNEl are two important proto-oncogenes 
involved in cell cycle progression. The functional corre- 
lation of MYC and TP53 in the carcinogenic progression 
of ovarian carcinoma and other cancers have been eval- 
uated in several studies [47-49]. Recently, Kuhn et al 
discovered that molecular genetic aberrations of CCNEl 
together with those of the pS3 and PI3K pathways are 
major mechanisms in the development of uterine serous 
carcinoma [50]. Both CHKB and KLHDC7B are located 
on chromosome 22ql3.33, where KLHDC7B is involved 
in breast cancer and lymph node metastasis in cervical 



cancer and CHKB encodes choline kinase (ChoK) beta, de 
Molina et al demonstrated that ChoK acts as a link con- 
necting phospholipid metabolism and cell cycle regulation 
[51]. It is here supposed that TPS3 and CHKB may regu- 
late CDK4/6 collaboratively to suppress the progression of 
ovarian cancer. 

To identify other driver gene sets coupled to TPS3, we 
applied mod_CoMDP with r = 1 10 to the ovar- 
ian cancer data and significant results were obtained for 
r = 3 ~ 10 (Additional file 1: Table SI). For example, 
for r = 10 we identified {MYC, CCNEl, NINJ2, MGs, 
USH2A, NFl, HMCNl, ZNFS96, USP3S, MGe} {MGe is 
a metagene including four genes: STMN3, SLC2A4RG, 
ZGPAT, RTELl) with ra = 0.6563 (the co-occurrence 
ratio with TPS3). Frequent somatic mutations in NFl 
have been previously shown to co-occur with TPS3 muta- 
tions in ovarian carcinomas [52,53]. STMN3 and NFl 
have been demonstrated to be involved in the MAPK sig- 
naling pathway [19]. Furthermore, to discover possible 
collaborations of multiple driver pathways with TP53, we 
combined TP53 and the aforementioned 10-gene set into 
one nominal gene, which was considered mutated in a 
sample if both sets were mutated in that sample. Then we 
applied mod_CoMDP to identify gene sets significantly 
co-occurring with the nominal gene. For r = 1 ~ 10 
we identified PPP2R2A, which is generally implicated in 
the negative control of cell growth and division. Kalev 
et al revealed that PPP2R2A plays a critical role in DNA 
double-strand break repair through modulation of ATM 
phosphorylation [54]. Youn and Simon recently stud- 
ied mutator alterations relevant to ovarian cancer [55]. 



Table 4 Co-occurring gene sets identified by applying CoMDP to the ovarian carcinoma dadaset 



k 


Gene set 1 


Gene set 2 


Pi 


P2 






^1,2 


Pi, 2 


4 


TP53 


MYQ CCNEl NINJ2 


1 .0000 


< 0.0001 


251 


155 


0.4397 


0.0410 






MYQ CCNEl 














5 


TP53 


NINJ2, MGs 


1 .0000 


0.0100 


251 


169 


0.4894 


0.0250 






MYQ CCNEl 














6 


TP53 


NINJ2,ZNF596,USH2A 


1 .0000 


< 0.0001 


251 


183 


0.5228 


0.0120 






MYC, CCNEl 














7 


TP53, LYRM5 


NINJ2,ZNF596,U5H2A 


0.0270 


0.0030 


264 


183 


0.5629 


< 0.0001 






MYQCCNEl NINJ2, 














8 


TP53, LYRM5 


BRD4,ZNF596,USH2A 


0.0230 


0.0210 


264 


197 


0.6007 


0.001 






MYQ CCNEl NINJ2, 














9 


TP53, LYRM5 


BRD4,ZNF596, 

USH2AHMCN1 


0.0340 


0.0160 


264 


206 


0.6263 


< 0.0001 






MYQCCNEl NINJ2, 














10 


TP53, LYRM5 


NFIZNF596, 


0.0390 


0.001 


264 


211 


0.6493 


< 0.0001 



USH2AHMCNITPD52L2 



MG5 is a metagene including two genes: CHKB, KLHDC7B. For /c = 4 ~ 6 because of only one gene contained in the gene set 1 , the corresponding p-value equals 1 .0000. 
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Besides the well-known mutator gene TP53, they identi- 
fied PPP2R2A and the chromosomal region 22ql3.33 as 
the new mutator candidates. We find that these so called 
mutator genes, which increase genomic instability when 
altered, may be collaboratively involved in the processes 
of DNA synthesis and repair, chromosome segregation, 
damage surveillance, cell cycle checkpoints, and apopto- 
sis. The discovered driver patterns here may provide new 
information to enhance our understanding of the ovarian 
carcinoma pathogenesis, and further explorative analysis 
is needed to verify their biological relevance. 

Conclusions 

In this study, we proposed a method CoMDP for the 
de novo identification of co-occurring driver pathways in 
cancer. It considers two types of optimization simultane- 
ously: First, it makes the maximization of the weight W 
for each individual pathway, i.e., high coverage and high 
exclusivity. Second, it ensures that the maximization of the 
inter-overlap between the pathway pair. Simulation study 
indicated that for a range of values of the parameters k 
and rj, CoMDP can always get the exact solution. It was 
here demonstrated that CoMDP has the following char- 
acteristics: (1) It can identify individual driver gene sets 
as BLP [19] or Dendrix [18]. (2) It obtains more accurate 
and robust results when the noise increases. (3) It uses 
no prior information such as the incomplete knowledge 
about the pathways and protein interaction networks. (4) 
CoMDP is an exact method and the procedure is quite 
fast. 

When the project approximated to the end, we noticed 
that Leiserson et at. proposed a method for the simulta- 
neous identification of multiple driver pathways [56]. The 
present study is related to Leisersons but in many ways 
quite different. First, the so-called Dendrix/^p in [56] is the 
same as the BLP method [19]. Second, the weight function 
used by their Multi-Dendrix algorithm does not explicitly 
incorporate co-occurrence of mutations between genes 
in different pathways [56]. The Multi-Dentrix of the two 
gene sets was found to be a special form of the present 
model with A. = 2 and ?7 = 1. Our simulation study 
demonstrated that, in this case, the coverage of the two 
sets detected was 286 and 331 respectively. Although the 
union coverage got larger (i.e., 437), a lower co-occurrence 
ratio 0.4119 was obtained because of the smaller common 
coverage. This also indicates that the multiple driver path- 
ways (gene sets) identified by Multi-Dendrix cannot be 
guaranteed to be co-occurring. 

We note that the heterogeneity among tumors can affect 
the findings of the current method. Investigating combi- 
natorial patterns of driver pathways in different subtypes 
will be helpful for understanding the molecular mech- 
anisms of carcinogenesis and designing efficient treat- 
ments for cancer patients. It will be interesting to explore 



the effect of heterogeneity among tumors in the future 
studies. 

In summary, we have developed a method to iden- 
tify co-occurring driver pathways, which may reveal the 
functional cooperation of different driver pathways dur- 
ing carcinogenesis. The results of this study show that 
the present method will be a powerful tool to explore the 
collaborative effects among mutated driver pathways and 
enhance our understanding of the molecular mechanisms. 

Additional file 



Additional file 1 : The results by applying mod_CoMDP to the ovarian 
carcinoma dadaset. 
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